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Damage recovery is critical for autonomous robots that need 
to operate for a long time without assistance. Most current 
methods are complex and costly because they require antici- 
pating each potential damage in order to have a contingency 
plan ready. As an alternative, we introduce the T-resilience al- 
gorithm, a new algorithm that allows robots to quickly and au- 
tonomously discover compensatory behaviors in unanticipated 
situations. This algorithm equips the robot with a self-model 
and discovers new behaviors by learning to avoid those that 
perform differently in the self-model and in reality. Our algo- 
rithm thus does not identify the damaged parts but it implicitly 
searches for efficient behaviors that do not use them. We evalu- 
ate the T-Resilience algorithm on a hexapod robot that needs to 
adapt to leg removal, broken legs and motor failures; we com- 
pare it to stochastic local search, policy gradient and the self- 
modeling algorithm proposed by Bongard et al. The behavior 
of the robot is assessed on-board thanks to a RGB-D sensor and 
a SLAM algorithm. Using only 25 tests on the robot and an 
loverall running time of 20 minutes, T-Resilience consistently 
leads to substantially better results than the other approaches. 

1. Introduction 

Autonomous robots are inherently complex machines that have 
to cope with a dynamic and often hostile environment. They 
face an even more demanding context when they operate for a 
long time without any assistance, whether when exploring re- 
mote places < |Bellingham and Rajan 2007) or, more prosaic ally, in 
a house without any robotics expert (Prassler and Kosuge, 2008V 
As famously pointed out by Corbato ( 2007^ when designing such 
complex systems, ''[we should not] wonder if some mishap may 
happen, but rather ask what one will do about it when it occurs''. 
In autonomous robotics, this remark means that robots must be 
able to pursue their mission in situations that have not been an- 
ticipated by their designers. Legged robots clearly illustrate this 
need to handle the unexpected: to be as versatile as possible, 
they involve many moving parts, many actuators and many sen- 
sors ^Kajita and Espiau ] 2008) ; but they may be damaged in nu- 
merous different ways. These robots would therefore greatly ben- 
efit from being able to autonomously find a new behavior if some 
legs are ripped off, if a leg is broken or if one motor is inadver- 
tently disconnected (Fig.[l|. 

Fault tolerance and resilience are classic topics in robotics and 
engineering. The most classic approaches combine intensive test- 
ing with redundancy of components ( Vi sinsky et al.} [1994} |Ko-| 
ren and Krishna 2007) . These methods undoubtedly proved their 
usefulness in space, aeronautics and numerous complex systems, 
but they also are expensive to operate and to design. More impor- 
tantly, they require the identification of the faulty subsystems and 
a procedure to bypass them, whereas both operations are difficult 
for many kinds of faults - for example mechanical failures. An- 
other classic approach to fault tolerance is to employ robust con- 
trollers that can work in spite of damaged sensors or hardware 
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(a) Normal state. 



(b) Two legs ripped out. 





(c) One broken leg. 



(d) Two unpowered motors. 



Figure 1: Examples of situations in which an autonomous robot 
needs to discover a qualitatively new behavior to pur- 
sue its mission: in each case, classic hexapod gaits can- 
not be used. The broken leg example (c) is a typical 
damage that is hard to diagnose by direct sensing (be- 
cause no actuator or sensor is damaged). 



ineffi ciencies {Goldberg and Chen 2001} Caccavale and Villani 



[20021 IQu et al.| [2003) |Lin and Chen|T2007t . Such controllers usu- 
ally do not require diagnosing the damage, but this advantage is 
tempered by the need to integrate the reaction to all faults in a 
single controller. Last, a robot can embed a few pre-designed be- 
haviors to cope with anticipated potential failur es {Corner and 



'Hirzinger 2010 Jakimovski and Maehle 2010 Mostafa et al. 



2010 ; S chley er and RusselH|2010) . For instance, if a hexapod robot 
detects that one of its legs is not reacting as expected, it can drop it 
and adapt the position of the other legs accordingly {Jakimovski] 
[and Maehle|[20T0}|Mostafa et al.||20T0) . 

An alternative line of thought is to let the robot learn on its own 
the best behavior for the current situation. If the learning pro- 
cess is open enough, then the robot should be able to discover 
new compensatory behaviors in situations that have not been 
foreseen by its designers. Numerous learning systems have been 
experimented in robotics (for reviews, see|Connell a nd Maha de-| 
' ' p993);|Argal l et al.|<2509);|N guyen-Tuong and P eters (201TJ; 



Kober and Peters (2012)), with different levels of openness and 



various a priori constraints. Many of them primarily aim at auto- 
matically tuning controllers for complex robots jKohl and Stone 
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70(M ; Tedrake e^alj |2005| |Sproewitz et al. 



2008 



Hemker et al. 



2009) whereas only a handful of these systems has been explicitly 
tested in situations in which a robot needs to adapt itself to un 
expected situations iMahdavi and Bentley 2003 Berenson et al. 



2005 : B ongard et aH|2006> 

Finding the behavior that maximizes performance in the cur- 
rent situation is a reinforcement learning problem ^ Sutton and Barto 



{1998[ , but classic reinforcement learning algorithms (e.g. TD- 
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Learning, SARSA, ...) are designed for discrete state spaces fSut-" 
[ton and Barto. . 1998 : Togelius et al.^, 2QQ9j . They are therefore hard 
to use when learning continuous behaviors such as locomotion 
patterns. Policy gradient algorithms < |Kohl and Stone 2004 Peters 



and Schaalj [2008} [Peters} |2010) are reasonably fast learning algo- 



rithms that are better suited for robotics (authors typically report 
learning time of 20 minutes to a few hours), but they are essen- 
tially limited to a local search in the parameter space: they lack 
the openness of the search that is required to cope with truly un- 
foreseen situations. Evolutionary Algorithms (EAs) (Deb) [2001': 
|De Jong} |2006 ) can optimize reward functions in larger, more 
open search spaces (e.g. automatic design of neural networks, 
design of stru ctures) ^Grefenstette et aL 1999 Heidrich-Meisner 



and I gel"2009^'Togelius et al.}|2009}|Doncieux et al.}|2011}|Hornby 
et al. 2011 ; Whiteson 2012|, but this openness is counterbalanced 
by substantially longer learning time (according to the literature, 
2 to 10 hours for simple robotic behaviors). 

All policy gradient and evolutionary algorithms spend most 
of their running time in evaluating the quality of controllers by 
testing them on the target robot. Since, contrary to simulation, re- 
ality cannot be sped up, their running time can only be improved 
by finding strategies to evaluate fewer candidate solutions on the 
robot. In their ''starfish robot'' project, [Bongard et al. |(2006) de- 
signed a general approach for resilience that makes an impor- 
tant step in this direction. The algorithm of Bongard et al. is 
divided into two stages: (1) automatically building an internal 
simulation of the whole robot by observing the consequences of 
a few elementary actions (about 15 in the demonstrations of the 
paper) - this internal simulation of the whole body is called a 
self-mode^ (Metzinger , 2004 , 2007} [Vogeley et al.} [1999} [Bongard 



^Far||2006}|Holland and Goodman} |2003} [Hoffmann et al.}|2010F 

(2) launching in this simulation an EA to find a new controller. In 
effect, this algorithm transfers most of the learning time to a com- 
puter simulation, which makes it increasingly faster when com- 
puters are improved (Moore. . ^197 5). 

Bongard 's algorithm highlights how mixing a self-model with 
a learning algorithm can reduce the time required for a robot to 
adapt to an unforeseen situation. Nevertheless, it has a few im- 
portant shortcomings. First, actions and models are undirected: 
the algorithm can "waste" a lot of time to improve parts of the 
self-model that are irrelevant for the task. Second, it is compu- 
tationally expensive because it includes a full learning algorithm 
(the second stage, in simulation) and an expensive process to se- 
lect each action that is tested on the robot. Third, there is often 
a "reality gap" between a behavior learned in simulation and the 
same behavior on the target robot ^Jakobi et"aL [1995}[Zagal et aT} 
2004 Koos et al. 2012[ , but nothing is included in Bongard's al- 
gorithm to prevent such gap to happen: the controller learned in 
the simulation stage may not work well on the real robot, even if 
the self-model is accurate. Last, one can challenge the relevance 
of calling into question the full self-model each time an adapta- 
tion is required, for instance if an adaptation is only temporarily 
useful. 

In the present paper we introduce a new resilience algorithm 
that overcomes these shortcomings while still performing most 
of the search in a simulation of the robot. Our algorithm works 
with any parametrized controller and it is especially efficient on 
modern, multi-core computers. More generally, it is designed for 
situations in which: 

• behaviors optimized on the undamaged robot are not effi- 
cient anymore on the damaged robot (otherwise, adaptation 
is useless) and qualitatively new behavior is required (other- 
wise, local search algorithms should perform better); 



^ Following the literature in psychology I Metzinger 2004 2007 Vogeley et al., 1999 1 
and artificial intelligence I Bongard et al. 2006 Holland and Goodraan 2003 1, we 
define a self-model as a forward, internal model ot the whoie body that is acces- 
sible to introspection and instantiated in a model of the environment. In the 
present paper, we only consider a minimal model of the environment (a horizon- 
tal plane). 



• the robot can only rely on internal measurements of its state 
(truly autonomous robots do not have access to perfect, ex- 
ternal sensing systems); 

• some damages cannot be observed or measured directly 
(otherwise more explicit methods may be more efficient). 

Our algorithm is inspired by the "transferability ap- 
proach" (Koos et al. 2012 Mouret et al. 2012| , whose original 



purpose is to cross the "reality gap" that separates behaviors op- 
timized in simulation to those observed on the target robot. The 
main proposition of this approach is to make the optimization 
algorithm aware of the limits of the simulation. To this end, a 
few controllers are transferred during the optimization and a 
regression algorithm (e.g. a SVM or a neural network) is used to 
approximate the function that maps behaviors in simulation to 
the difference of performance between simulation and reality. To 
use this approximated transferability function, the single-objective 
optimization problem is transformed into a multi-objective 
optimization in which both performance in simulation and 
transferability are maximized. This optimization is typically per- 
formed with a stochastic multi-objective optimization algorithm 
but other optimization algorithms are conceivable. 

As this paper will show, the same concepts can be applied to 
design a fast adaptation algorithm for resilient robotics, leading to 
a new algorithm that we called "T-Resilience" (for Transferability- 
based resilience). If a damaged robot embeds a simulation of it- 
self, then behaviors that rely on damaged parts will not be trans- 
ferable: they will perform very differently in the self-model and 
in reality. During the adaptation process, the robot will thus cre- 
ate an approximated transferability function that classifies behav- 
iors as "working as expected" and "not working as expected". 
Hence the robot will possess an "intuition" of the damages but it 
will not explicitly represent or identify them. By optimizing both 
the transferability and the performance, the algorithm will look 
for the most efficient behaviors among those that only use the re- 
liable parts of the robots. The robot will thus be able to sustain a 
functioning behavior when damage occurs by learning to avoid 
behaviors that it is unable to achieve in the real world. Besides 
this damage recovery scenario, the T-Resilience algorithm opens 
a new class of adaptation algorithms that benefit from Moore's 
law by transferring most of the adaptation time from real experi- 
ments to simulations of a self-model. 

We evaluate the T-Resilience algorithm on an 18-DOFs hexa- 
pod robot that needs to adapt to leg removal, broken legs and 
motor failures; we compare it to stochastic local se arch ^Hoos 



and Stiitzle , 2005 ), policy gradient ^Kohl and Stone 2004 ) and 



Bongard's algorithm (Bongard et al. ]2006} . The behavior on the 



real robot is assessed on-board thanks to a RGB-D sensor coupled 
with a state-of-the-art SLAM algorithm jEndres et ar}[20T2| . 



2. Learning for resilience 

Discovering a new behavior after a damage is a particular case of 
learning a new behavior, a question that generates an abundant lit- 
erature in artificial intelligence since its beginnings ( Turing}[l950^ . 
We are here interested in reinforcement learning algorithms be- 
cause we consider scenarios in which evaluating the performance 
of a behavior is possible but the optimal behavior is unknown. 
However, classic reinforcement learning algorithms are primar- 
ily designed for discrete states and discrete actions ^Sutton and 
[Barto) [1998{ [Peters| [2010^ , whereas autonomous robots have to 
solve many continuous problems (e.g. motor control). Two alter- 
native families of methods are currently prevalent for continuous 
reinforcement learning in robotics (table[T|: policy gradient meth- 
ods and evolutionary algorithms. These two approaches both 
rely on optimization algorithms that directly optimize parame- 
ters of a controller by measuring the overall performance of the 
robot (Fig. [2]); learning is thus here regarded as an optimization of 
these parameters. 
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Figure 2: Principle of resilience processes based on policy gradi- 
ent. Controllers are optimized by measuring rewards on 
the robot. 



2.1. Policy gradient methods 



Policy gradient methods (S utton et al.} 2000 Peters and Schaal 
[2008} [Petersj |2010^) use iterative stochastic optimization algo- 
rithms to find a local extremum of the reward function. The 
search starts with a controller that can be generated at random, 
designed by the user or inferred from a demonstration. The al- 
gorithm then iteratively modifies the parameters of the controller 
by estimating gradients in the control space and applying slight 
changes to the parameters. 

Typical policy gradient methods iterate the following steps: 

• generation of N controllers in the neighborhood of the cur- 
rent vector of parameters (by variating one or multiple pa- 
rameter values at once); 

• estimation of the gradient of the reward function in the con- 
trol space; 

• modification of parameter values according to the gradient 
information. 

These steps are iterated until a satisfying controller is found 
or until the process converges. Policy gradient algorithms essen- 
tially differ in the way gradient is estimated. The most simple 
way is the finite-difference method, which independently esti- 
mates the local gradient of the reward function for each param- 
eter (Kohl and Stone, 2004 : jTedrake et al. |2005) : considering a 
given parameter, if higher (resp. lower) values lead to higher re- 
wards on average on the N controllers tested during the current 
iteration, the value of the parameter is increased (resp. decreased) 
for the next iteration. Such a simple method for estimating the 
gradient is especially efficient when parameters are mostly inde- 
pendent. Strong dependencies between the parameters often re- 
quire more sophisticated estimation techniques. 

Policy gradient algorithms have been successfully applied to 



locomotion tasks in the case of quadruped (Kimura et al. 



^001 



Kohl and Stone] |2004| and biped robots ( Tedrake et aH|2005^ but 



they typically require numerous evaluations on the robot, most of 
the times more than 1000 trials in a few hours (table [l}. To make 
learning tractable, these examples all use carefully designed con- 
trollers with only a few degrees of freedom. They also typically 
start with well-chosen initial parameter values, making them ef- 
ficient algorithms for imitation learning when these values are 
extracted from a demonstration by a human iKober and Peters 
|2010). Recent results on the locomotion of a quadruped robot 
suggest that using random initial controllers would likely require 
many additional experiments on the robot | |Yosinski et al. 2011[ . 
Consistent results have been reported on biped locomotion with 
computer simulations using random initial controllers that make 
the robot fall (Nakamura et alj|2007| (about 10 hours of learning 
for 11 control parameters). 

2.2. Evolutionary Algorithms 



Evolutionary Algorithms (EAs) peb} |2001{ |De Jong} 12006) are 
another family of iterative stochastic optimization methods that 
search for the optima of function jCrefenstette et aL 1999 



models [ learning by l!?.^.^!.!!]!:^."^^' controllers 
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Figure 3: Principle of Bongard's algorithm. (1) A self-model is 
learned by testing a few actions on the damaged robot. 
(2) This self-model is next used as a simulation in which 
a new controller is optimized. 



[Heidrich-Meisner and Igel} |2009[ . They are less prone to local 
optima than policy gradient algorithms and they can optimize 
arbitrary structures (neural networks, fuzzy rules, vector of pa- 
rameters, ...) <|Doncieux e t al. 201 1}|Hornby et al. 2011 Mouret 
[and Doncieux||2012}|Whiteson, 2012^ 

While there exists many variants of EAs, the vast majority of 
them iterate the following steps: 

• (first iteration only) random initialization of a population of 
candidate solutions; 

• evaluation of the performance of each controller of the pop- 
ulation (by testing the controller on the robot); 

• ranking of controllers; 

• selection and variation around the most efficient controllers 
to build a new population for the next iteration. 

Learning experiments with EAs are reported to require many 
hundreds of trials on the robot and to last from two to tens 
of hours (table E As have been applied to quadruped 

I 20TIJ , hexapod 
20061 and hu- 



robots jLLornby et al 
robots i |Zykov et al.} 



2005 




Yosinski et al. 




2004 


Barfoot et al. 





manoids fKatic and Vukobratovic}|2003} [Palmer et al.}j2009l EAs 
have also been used in a few studies dedicated to resilience, in 
particular on a snake-like robot with a damaged body (Mahd avl] 
and Bentley} 2003) (about 600 evaluations/ 10 hours) and on a 
quadrupedal robot that breaks one of its leg ^Berenson et al. 2005} 
(about 670 evaluations /2 hours). 

Aside from these two main types of approaches, several au- 
thors proposed to use other black-box optimization algorithms: 
global methods like Nelder-Mead desce nt ^Weingarten et aLl 
|2004[ , local methods like PowelL s method jSproewitz et al.}[2008t 
or surrogate-based optimization (Hemker et al. 2009). Published 
results are typically obtained with hundreds of evaluations on the 
robot, requiring several hours (table [T|. 

Regardless of the optimization technique, reward functions are, 
in most studies, evaluated with external tracking devices (table[l| 
last column). While this approach is useful when researchers 
aims at finding the most efficient controllers ( e.g. Kohl and Stone) 
p004) ; [Sproewitz et al.[ pOOS) ; [Hemker et al.[pod9^ ), learningal^ 
gorithms that target adaptation and resilience need to be robust 
to the inaccuracies and constraints of on-board measurements. 



2.3. Resilience based on self-modeling 



Instead of directly learning control parameters, Bongard et al. 
< |2006} propose to improve the resilience of robots by equipping 
robots with a self-model. If a disagreement is detected between 
the self-model and observations, the proposed algorithm first in- 
fers the damages by chosing motor actions and measuring their 
consequences on the behavior of the robot; the algorithm then 
relies on the updated model of the robot to learn a new behav- 
ior. This approach has been successfully tested on a starfish-like 
quadrupedal robot (Bon gard et"aD , '2006 ''Zy kov} [2008 ). By adapt- 
ing its self -model, the robot manages to discover a new walking 
gait after the loss of one of its legs. 
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Table 1: Typical examples of learning algorithms that have been used on legged robots. 



approach /article 


starting beh. * 


learning time 


robot 


DOFs^ 


param.* 


reward 


Policy Gradient Methods 
















Kimura et al.'^2001^ 


no info. 


80 min. 


quadruped 


8 


72 


internal 




Kohl and Stone ( 20d4| 


walking 


3h 


quadruped 


12 


12 


external 




Tedrake et al. ( 2005 ) 


standing 


20 min. 


bidepal 


2 


46 


internal 


Evolutionary Algorithm 
















Chernova and Veloso (2004) 


random 


5h 


quadruped 


12 


54 


external 




Zykov et al. (2004 ) ' 


random 


2h 


hexapod 


12 


72 


external 




Berenson et al. (2005) 


random 


2h 


quadruped 


8 


36 


external 




Hornby et al. (2005) 


non-falling 


25h 


quadruped 


19 


21 


internal 




Mahdavi and Bentley||2006 1 


random 


10 h 


snake 


12 


1152 


external 




Bartoot et al. (2006) 


random 


10 h 


hexapod 


12 


135 


external 




Yosinski et al.H2011) 


random 


2h 


quadruped 


9 


5 


external 


Others 
















Wein^arten et al. |2004[ ' 


walking 


> 15h 


hexapod (Rhex-like) 


6 


8 


external 




Sproewitzet al. (2008| 


random 


60 min. 


quadruped 


8 


5 


external 




Hemker et al. (2009 )"^ 


walking 


3-4 h 


biped 


24 


5 


external 




Farfoot et al. (2006 ) ^ 


random 


Ih 


hexapod 


12 


135 


external 



Behavior used to initialize the learning algorithra. 
^ DOFs: nuraber of controlled degrees of freedora. 
* parara: nuraber of learned control pararaeters. 

^ Nelder-Mead descent. ^ Powell raethod. ^ Design and Analysis of Coraputer Experiraents. Multi-agent reinforcement learning 



In Bongard's algorithm, the identification of the self-model is 
based on an active learning loop that is itself divided into an ac- 
tion selection loop and a model selection loop (Fig. 3|. The action se- 
lection loop aims at selecting the action that will best distinguish 
the models of a population of candidate models. The model se- 
lection loop looks for the models that best predict the outcomes 
of the actions as measured on the robot. In the ''starfish'' experi- 
ment ( [Bongard et al. 2006^ , the following steps are repeated: 

1.1. action selection (exploration): 

- each of the 36 possible actions is tested on each of the 16 
candidate models to observe the orientation of robot's 
body predicted by the model; 

- the action for which models of the population disagree 
at most is selected; 

- this action is tested on the robot and the correspond- 
ing exact orientation of robot's body is recorded by an 
external camera; 

1.2. model selection loop (estimation): 

- a stochastic optimization algorithm (an FA) is used to 
optimize the population of models so that they accu- 
rately predict what was measured with the robot, for 
each tested action; 

- if less than 15 actions have been performed, the action 
selection loop is started again. 

Once the 15 actions have been performed, the best model found 
so far is used to learn a new behavior using an FA: 
2. controller optimization (exploitation): 

- a stochastic optimization algorithm (an FA) is used to 
optimize a population controllers so that they maximize 
forward displacement within the simulation of the self- 
model; 

- the best controller found in the simulation is transferred 
to the robot, making it the new controller. 

The population of models is initialized with the self-model that 
corresponds to the morphology of the undamaged robot. Since 
the overall process only requires 15 tests on the robot, its speed 
essentially depends on the performance of the employed com- 
puter. Significant computing times are nonetheless required for 
the optimization of the population of models. 



In the results reported by 'Bongard et al. ( |2006[ , only half of 
the runs led to correct self-models. As Bongard's. approach im- 
plies identifying a full model of the robot, it would arguably re- 
quire many more tests to converge in most cases to the right mor- 
phology. For comparison, results obtained by the same authors 
but in a simulated experiment required from 600 to 1500 tests to 
consistently identify the model ( Bongard and Lipson 2005) . It 
should also be noted that these authors did not measure the ori- 
entation of robot's body with internal sensors, whereas noisy in- 
ternal measurements could significantly impair the identification 
of the model. Other authors experimented with self-modeling 
process similar to the one of Bongard et al., but with a humanoid 
robot l |Zagal et al. 2009}. Preliminary results suggest that thou- 



sands of evaluations on the robot would be necessary to correctly 
identify 8 parameters of the global self-model. Alternative meth- 
ods have been proposed to build self-models for robots and all 
of them require numerous tests, e.g. on a manipulator arm with 



about 400 real tests jSturm et al. 2008} or on a hexapod robot with 
about 240 real tests (Parker' "2009). Overall, experimental costs for 
building self-models appear expensive in the context of resilience 
applications in both the number of tests on the real robot and in 
computing time. 

Furthermore, controllers obtained by optimizing in a simula- 
tion - as does the algorithm proposed by Bongard et al. - often do 
no work as well on the real robot than in simulation jKoos et aT] 
2012; Zagal et al.} pOl) |Jakobi et al.}p95) . In effect, t his classic 
problem has been observed in the starfish experiments [Bongard 
|et al.| ( [2006 }. In these experiments, it probably originates from the 
fact that the identified self-model cannot perfectly model every 
detail of the real world (in particular, slippage, friction and very 
dynamic behaviors). 



2.4. Concluding thoughts 

Based on this short survey of the literature, two main thoughts 
can be drawn: 

1. Policy gradient methods and FAs can both be used to dis- 
cover original behaviors on a damaged robot; nevertheless, 
when they don't start from already good initial controllers, 
they require a high number of real tests (at least a few hun- 
dred), which limits the speed of the resulting resilience pro- 
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2. Methods based on self-modeling are promising because they 
transfer some of the learning time to a simulation; however 
building an accurate global model of the damaged robot re- 
quires many real tests; reality gap problems can also occur 
between the behavior learned with self-model and the real, 
damaged robot. 



3. The T-Resilience algorithm 
3.1. Concept and intuitions 

Following Bongard et al., we equip our robot with a self-model. A 
direct consequence is that detecting the occurrence of a damage is 
facilitated: if the observed performance is significantly different 
from what the self-model predicts, then the robot needs to start a 
recovery process to find a better behavior. Nevertheless, contrary 
to Bongard et al., we propose that a damaged robot discovers new 
original behaviors using the initial, hand-designed self-model, that is 
without updating the self-model. Since we do not attempt to di- 
agnose damages, the solved problem is potentially easier than the 
one solved by Bongard et al; we therefore expect our algorithm to 
perform faster. This speed increase can, however, comes at the 
price of slightly less efficient post-damage behaviors. 

The model of the undamaged robot is obviously not accurate 
because it does not model the damages. Nonetheless, since dam- 
ages can't radically change the overall morphology of the robot, 
this ''undamaged'' self-model can still be viewed as a reasonably 
accurate model of the damaged robot. Most of the degrees of free- 
dom are indeed correctly positionned, the mass of components 
should not change much and the body plan is most probably not 
radically altered. 

Imperfect simulators and models are an almost unavoidable 
issue when robotic controllers are first optimized in simulation 
then transferred to a real robot. The most affected field is prob- 
ably evolutionary robotics because of the emphasis on opening 
the search space as much as possible: behaviors found within the 
simulation are often not anticipated by the designer of the simu- 
lator, therefore it's not surprising that they are often wrongly sim- 
ulated. Researchers in evolutionary robotics explored three main 
ideas to cross this "reality gap": (1) automatically improving sim- 
ulators < |Bongard et al.} |2006{ [Pretorius et al.} |2012{ Klaus et aL 



2012| , (2) trying to prevent optimized controllers from relying on 



the unreliable parts of the simulation (in particular, by adding 
noise) ( Jakobi^ al.}p?95j, and (3) model the difference between 



simulation and reality { [Hartland and Bredeche 2006 Koos et al. 
[2012} . 

Translated to resilient robotics, the first idea is equivalent to 
improving or adapting the self-model, with the aforementioned 
shortcomings (sections and [23) . The second idea corresponds 
to encouraging the robustness of controllers so that they can deal 
with an imperfect simulation. It could lead to improvements in 
resilient robotics but it requires that the designer anticipates most 
of the potential damages. The third idea is more interesting for 
resilient robotics because it acknowledges that simulations are 
never perfect and mixes reality and simulation during the op- 
timization. Among the algorithms of this family, the recently- 
proposed transferability approach (K oos et al.} |2012) explicitly 
searches for high-performing controllers that work similarly in 
both simulation and reality. It led to successful solutions for 
quadruped robot (2 parameters to optimize) and for a Khepera- 
like robot in a T-maze (weights of a feed-forward neural networks 
to optimize) (Koos et al. . 2012). 

The main assumption of the transferability approach is that 
some transferable behaviors exist in the search space. Although 
formulated in the context of the reality gap, this assumption holds 
well in resilient robotics. For instance, if a hexapod robot breaks a 
leg, then gaits that do not critically rely on this leg should lead to 
similar trajectories in the self-model and on the damaged robot. 



Such gaits are numerous: those that make the simulated robot lift 
the broken leg so that it never hits the ground; those that make the 
robot walk on its "knees"; those that are robust to leg damages be- 
cause they are closer to crawling than walking. Similar ideas can 
be found for most robots and for most mechanical and electrical 
damages, provided that there are different ways to achieve the 
mission. For example, any redundant robotic manipulator with a 
blocked joint should be able to follow a less efficient but working 
trajectory that does not use this joint. 

The transferability approach captures the differences be- 
tween the self-model and reality through the transferability func- 
tion ( Mouret et al.}|2012t : 

Definition 1 (transferability function) A transferability function 

is a function that maps, for the whole search space, descriptors of solu- 
tions (e.g. control parameters, or behavior descriptors) to a transfer- 
ability score that represents how well the simulation matches the real- 
ity. 

This function is usually not accessible but it can be learned 
with a regression algorithm (neural networks, support vector ma- 
chines, etc.) by recording the behavior of a few controllers in re- 
ality and in simulation. 

3.2. T-Resilience 

To cross the reality gap, the transferability approach essentially 
proposes optimizing both the approximated transferability and 
the performance of controllers with a stochastic multi-objective 
optimization algorithm. This approach can be adapted to make 
a robot resilient by seeing the original, "un-damaged" self-model 
as an inaccurate simulation of the damaged robot, and if the robot 
only uses internal measurements to evaluate the discrepancies 
between predictions of the self-model and measures on the real 
robot. Resilient robotics is thus a related, yet new application of 
the transferability concept. We call this new approach to resilient 
robotics "T-Resilience" (for Transferability-based Resilience). 

Algorithm. The T-Resilience algorithm relies on three main 
principles (Fig. [1] and Algorithm [T|: 

• the self-model of the robot is not updated; 

• the approximated transferability function is learned "on the 
fly" thanks to a few periodic tests conducted on the robot 
and a regression algorithm; 

• three objectives are optimized simultaneously: 

{^jelf{c) 
f{hself{c)) 
diver sity{c) 

where Tsei / (c) denotes the performance of the candidate solu- 
tion c that is predicted by the self-model (e.g. the forward dis- 
placement in the simulation); hseif{c) denotes the behavior de- 
scriptor of c, extracted by recording the behavior of c in the 
self-model; T{hseif{c)) denotes the approximated transferability 
function between the self-model and the damaged robot, which is 
separately learned using a regression algorithm; and diver sity{c) 
is a application-dependent helper-objective that helps the opti- 
mization algorithm to mitigate premature convergence ( (Toffolo 
jand Benini1 < |2003t ; pouret and DoncieuxH2012t ). 

Evaluating these three objectives for a particular controller 
does not require any real test: the behavior of each controller and 
the corresponding performance are predicted by the self-model; 
the approximated transferability value is computed thanks to the 
regression model of the transferability function. The update of the 
approximated transferability function is therefore the only step of the al- 
gorithm that requires a real test on the robot. Since this update is only 
performed every N iterations of the optimization algorithm, only 
a handful of tests on the real robot have to be done. 
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Figure 4: Schematic view of the T-Resilience algorithm (see algorithm [T] for an algorithmic view). (A) Discovery loop: each controller 
of the population is evaluated with the self-model. Its transferability score is approximated according to the current model 
T of the exact transferability function T. (B) Transferability update: every N iterations, a controller of the population is 
randomly selected and transferred onto the real robot. The model of the transferability function is next updated with the 
data generated during the transfer. 



At a given iteration, the T-Resilience algorithm does not need 
to predict the transferability of the whole search space, it only 
needs these values for the candidate solutions of the current pop- 
ulation. Since the population, on average, moves towards better 
solutions, the algorithm has to periodically update the approxi- 
mation of the transferability function. To make this update sim- 
ple and unbiased, we chose to select the solution to be tested on 
the robot by picking a random individual from the population. 
We experimented with other selection schemes in preliminary ex- 
periments, but we did not observe any significant improvement. 

Three choices depend on the application: 

• the performance measure Tseif (i.e. the reward function); 

• the diversity measure; 

• the regression technique used to learn the transferability 
function and, in particular, the inputs and outputs of this 
function. 

We will discuss and describe each of these choices for our re- 
silient hexapod robot in section|4] 

Optimization algorithm. Recent research in stochastic opti- 
mization proposed numerous algorithms to simultaneously op- 
timize several objectives (Deb^ 2001 ); most of them are based on 
the concept of Pareto dominance, defined as follows: 

Definition 2 (Pareto dominance) A solution p* is said to dominate 
another solution p, if both conditions 1 and 2 are true: 

1. the solution p* is not worse than p with respect to all objectives; 

2. the solution p* is strictly better than p with respect to at least one 
objective. 

The non-dominated set of the entire feasible search space is the 
globally Pareto-optimal set (Pareto front). It represents the set of 
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Figure 5: Choice of the final solution at the end of the T-Resilience 
algorithm. 



optimal trade-offs, that is solutions that cannot be improved with 
respect to one objective without decreasing their score with re- 
spect to another one. 

Pareto-based multi-objective optimization algorithms aim at 
finding the best approximation of the Pareto front, both in terms 
of distance to the Pareto front and of uniformity of its sampling. 
This Pareto front is found using only one execution of the algo- 
rithm and the choice of the final solution is left to another algo- 
rithm (or to the researcher). Whereas classic approaches to multi- 
objective optimization aggregate objectives (e.g. with a weighted 
sum) then use a single-objective optimization algorithm, multi- 
objective optimization algorithms do not require tuning the rela- 
tive importance of each objective. 

Current stochastic algorithms for multi-objective optimization 
are mostly based on EAs, leading to Multi-Objective Evolutionary 
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Algorithm 1 T-Resilience (T real tests) 



- 1 - 



, . . . , c*^ } (randomly generated) 



pop ^ 

f or i = 1 ^ T do 

random selection of c* in pop 
transfer of c* on the robot 

estimation of performance Treai (c*) using internal measurements 
estimation of the exact transferability value \^Tsei / (c*) — Treai (c*) | 
update of the approximated transferability function T 
N iterations of MOEA on pop 
end for 
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Figure 6: (a) The 18-DOF hexapod robot is equipped with a RGB-D camera (RGB camera with a depth sensor), (b) Snapshot of the 
simulation used as a self-model by the robot which occurs in an ODE-based physics simulator. The robot lies on a horizontal 
plane and contacts are simulated as well, (c) Kinematic scheme of the robot, (d) Control function 7(t, a, 0) with a — 1 and 
= 0. 



Al gorithms (MOEA) Like most EAs, they are intrinsically paral- to a RGB-D SLAM algorithm | En dres et al.} [2012^1 from the ROS 
lei ^Cantu-Paz 2000^ ^ making them especially efficient on mod- framework |Quigley et al. 2009p 
ern multi-core computers, GPUs and clusters < |Mouret and Don^ ~" " " 



2010). In the T-Resilience algorithm, we rely on NSGA- 
II peb et aT] |200 2 : Deb , 2001 ), one of the most widely used 
multi-objective optimization algorithm (appendix |B|; however, 
any Pareto-based multi-objective algorithm can replace this spe- 
cific EA in the T-Resilience algorithm. 

At the end of the optimization algorithm, the MOEA discards 
diversity values and returns a set of non-dominated solutions 
based on performance and transferability. We then need to choose 
the final controller. Let us define the ''transferable non-dominated 
set'' as the set of non-dominated solutions whose transferabil- 
ity values are greater than a user-defined threshold. To deter- 
mine the best solution of a run, the solution of the transferable 
non-dominated set with the highest performance in simulation 
is transferred onto the robot and its performance in reality is as- 
sessed. The final solution of the run is the controller that leads to 
the highest performance on the robot among all the transferred 
controllers (Fig.|5]). 



4. Experimental validation 

4.1. Robot and parametrized controller 

The robot is a hexapod with 18 Degrees of Freedom (DOE), 3 for 
each leg (Fig.[6|a,c)). Each DOE is actuated by position-controlled 
servos (6 AX-12 and 12 MX-28 Dynamixel actuators, designed by 
Robotis). The first servo controls the horizontal orientation of the 
leg and the two others control its elevation. The kinematic scheme 
of the robot is pictured on Figure[6]c. 

A RGB-D camera (Asus Xtion) is screwed on top of the robot. It 
is used to estimate the forward displacement of the robot thanks 



The movement of each DOE is governed by a periodic func- 
tion that computes its angular position as a function 7 of time t, 
amplitude a and phase (j) (Fig. [6] d): 



7(t, a, (j)) — a • tanh (4 • sin (2 • tt • (t + cj)))) 



(1) 



where a and are the parameters that define the amplitude of 
the movement and the phase shift of 7, respectively. Frequency is 
fixed. 

Angular positions are sent to the servos every 30 ms. The main 
feature of this particular function is that, thanks to the tanh func- 
tion, the control signal is constant during a large part of each cy- 
cle, thus allowing the robot to stabilize itself. In order to keep the 
"tibia" of each leg vertical, the same control signal is used for the 
two last servos. Consequently, positions sent to the i*^ servos are: 

• 7(t, a\, 01) for DOE 1; 

• 7(t, ai, 0^2) for DOES 2 and 3. 

There are 4 parameters for each leg {a\, a\, 02 ), therefore 
each controller is fully described by 24 parameters. By varying 
these 24 parameters, numerous gaits are possible, from purely 
quadruped gaits to classic tripod gaits. 

This controller is designed to be as simple as possible so that 
we can show the performance of the T-Resilience algorithm in a 
straightforward setup. Nevertheless, the T-Resilience algorithm 
does not put any constraint on the type of controllers and many 
other controllers are conceivable (e.g. bio-inspired central pattern 
genera tors likejSproewitz et al. <|2008| or evolved neural networks 
like in | |Yosinski et al.||2011{|Clune et al.||2011> ). 



^We downloaded our implementation from: Jhttp : / /www . ros . org/wiki/| 

rqbdslam 

' ^littp : / / www . ros . org 



Koos, Cully and Mouret 



arXiv | 7 



4.2. Reference controller 



A classic tr ipod gait ^W ilson, 1966; 'Sara nli et aT) | 2001': 'Sc hraitz 
[et al, 2001} [Ding etaL{ ,2010; Steingrube et al.||2010) is used as a 
reference point. This reference gait considers two tripods: legs 0, 
2, 4 and legs 1, 3, 5 (see Figure [6] for numbering). It is designed to 
always keep the robot balanced on at least one of these tripods. 
The walking gait is achieved by lifting one tripod, while the other 
pushes the robot forward (by shifting itself backward). The lifted 
tripod is then placed forward in order to repeat the cycle by in- 
verting the tripods. This gait is static, fast and similar to insect 
gaits ^ Wilson} [1966) pelcomynj |1971} . The parameters of this ref- 
erence controller are available in appendix [C| 

4.3. Implementation choices for T-Resilience 

Performance function. The mission of our hexapod robot is to 
go forward as fast as possible, regardless of its current state and 
of any sustained damages. The performance function to be opti- 
mized is the forward displacement of the robot predicted by its 
self-model. Such a high-level function does not constrain the fea- 
tures of the optimized behaviors, so that the search remains as 



f(hself(c)) 



(1) 



5 ^t^O^ 



(5) 



Open as possible, possibly leading to original gaits ^Nelson et al. 
[2009^ : 



-E,SELF^^^_^t^^O,SELF^^^ 



(2) 



where p^T^'^^^^ (c) denotes the x-position of the robot's center at 
the beginning of the simulation when the parameters c are used 
and p^^^'^^^^ (c) its x-position the end of the simulation. 

Because each trial lasts only a few seconds, this performance 
function does not strongly penalize gaits that do not lead to 
straight trajectories. Using longer experiments would penalize 
these trajectories more, but it would increase the total experimen- 
tal time too much to perform comparisons between approaches. 
Other performance functions are possible and will be tested in 
future work. 

Diversity function. The diversity score of each individual is the 
average Euclidean distance to all the other candidate solutions of 
the current population. Such a parameter-based diversity objec- 
tive enhances the exploration of the control space by the popula- 
tion jToffolo and Benini} [2003} [Mouret and Doncieux| |2012t and 
allows the algorithm to avoid many local optima. This diversity 
objective is straightforward to implement and does not depend 
on the task. 



where E is the number of time-steps of the control function (equa- 
tion [l} and: 



if leg n touches the ground at that time-step 



otherwise 



(6) 



We chose to describe gaits using contact^ because it is a classic 
representation of robotic and animal gaits (e.g. |Delcom3m ^1971^ ). 
On the real robots, we deduces the contacts by measuring the 
torque applied by each servo. 

We chose SVMs to approximate the transferability score be- 
cause of the high number of inputs of the model and because 
there are many available implementations. Contrary to other 
classic regression models (neural networks, Kriging, ...), SVMs 
are indeed not critically dependent on the size of the input 
space ( Smola and Va pnik}[T997 ; Smola and Scholkopf . 2004). They 
also provide fast learning and fast prediction when large input 
spaces are used. 

Self-model. The self-model of the robot is a dynamic simula- 
tion of the undamaged six-legged robot in Open Dynamics En- 
gine (ODEj^on a flat ground (Fig.jiJ?). 

Main parameters. For each experiment, a population of 100 
controllers is optimized for 1000 generations. Every 40 gener- 
ations, a controller is randomly selected in the population and 
transferred on the robot, that is we use 25 real tests on the robot 
in a run. Each test takes place as follows: 

• the selected controller is transferred and evaluated for 3 sec- 
onds on the robot while the RGB-D camera records both 
color and depth images at 10 Hz; 

• a SLAM algorithm estimates the forward displacement of the 
robot based on the data of the camera; 

• the estimate of the forward displacement is provided to the 
main algorithm. 

At each generation, each parameter of each selected candidate 
solution has a 10% chance of being incremented or decremented, 
with both options equally likely; five values are available for each 
if (0, 0.25, 0.5, 0.75, 1) and for each a (0, 0.25, 0.5, 0.75, 1). 

To select the final solution, we fixed the transferability thresh- 
old at 0.1 meter. 



diversity{c) = ^ 



(3) 



where Pn is the population at generation n, N the size of P and 
Cj the j^^ parameter of the candidate solution c. Other diversity 
measures (e.g. behavioral measures, like in |Mouret and Don-^ 
cieux 2012} ) led to similar results in preliminary experiments. 



Regression model. When a controller c is tested on the real 
robot, the corresponding exact transferability score T is com- 
puted as the absolute difference between the forward perfor- 
mance predicted by the self-model and the performance esti- 
mated on the robot based on the SLAM algorithm. 



T(c) 



SELF f N 
\Pt^E (C) 



REAL/ N 



(4) 



The transferability function is approximated by training a SVM 
model T using the zy-Support Vector Regression algorithm with 
linear kernels implemented in the library h'^sz;;tp|( C hang and Lin] 
[2011 1| (learning parameters are set to default values). 



^http : / /www . csie . ntu . edu . tw/ ~ c j lin/libsvm| 



4.4. Test cases and compared algorithms 

To assess the ability of T-Resilience to cope with many different 
failures, we consider the six following test cases (Fig.|7|: 

• A. the hexapod robot is not damaged; 

• B. the left middle leg is no longer powered; 

• C. the terminal part of the front right leg is shortened by half; 

• D. the right hind leg is lost; 

• E. the middle right leg is lost; 

• F. both the middle right leg and the front left leg are lost. 

We compare the The T-Resilience algorithm to three represen- 
tative algorithms from the literature (see appendix[D|for the exact 
implementations of each algorithm): 



^When choosing the input of a predictor, there is a large difference betwe en using 
the control pararaeters and using high-level descriptors of the behavior jMouret| 
[and Doricie ux 2012 1. Intuitively, most humans can predict that a behavior will 
work on a real robot by watching a simulation, but their task is much harder if 
they can only see the parameters. More technically, predicting features of a com- 
plex dynamical system usually requires simulating it. By starting with the output 
of a simulator, the predictor avoids the need to re-invent physical simulation and 
can focus on discrimination. 



Open Dynamics Engine: http : / /www . ode . org 
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Figure 7: Test cases considered in our experiments. (A) The hexa- 
pod robot is not damaged. (B) The left middle leg is no 
longer powered. (C) The terminal part of the front right 
leg is shortened by half. (D) The right hind leg is lost. (E) 
The middle right leg is lost. (F) Both the middle right leg 
and the front left leg are lost. 



Test cases 


A B C D E F 


Perf. (CODA) 
Perf. (SLAM) 


0.78 0.26 0.25 0.00 0.15 0.10 
0.75 0.17 0.26 0.00 0.04 0.16 



Table 2: Performances in meters obtained on the robot with the 
reference gait in all the considered test cases. Each test 
lasts 3 seconds. The CODA line corresponds to the dis- 
tance covered by the robot according to the external mo- 
tion capture system. The SLAM line corresponds to 
the performance of the same behaviors but reported by 
the SLAM algorithm. When internal measures are used 
(SLAM line), the robot can easily detects that a damage 
occurred because the difference in performance is very 
significant (column A versus the other columns). 

Final performance values are recorded with a CODA cxl mo- 
tion capture system (Charnwood Dynamics Ltd, UK) so that re- 
ported results do not depend on inaccuracies of the internal mea- 
surements. However, all the tested algorithms have only access 
to the internal measurements. 



• a stochastic local search jHoos and Stutzle)|2005^ , because of 
its simplicity; 

• a policy gradient method inspired from |Kohl and Stone | 
< |2004 |, because this algorithm has been successfully applied 
to learn quadruped locomotion; 

• a self-modeling process inspired from Bongard et al. ^2006} . 
To make the comparisons as fair as possible, we designed our 

experiments to compare algorithms after the same amount of run- 
ning time or after the same number of real tests (see appendix 
|E] for their median durations and their median numbers of real 
tests). In all the test cases, the T-Resilience algorithm required 
about 19 minutes and 25 tests on the robot (1000 generations of 
100 individuals). Consequently, two key values are recorded for 
each algorithm (see Appendix [D| for exact procedures): 

• the performance of the best controller obtained after about 
25 real test£] 

• the performance of the best controller obtained after about 
19 minutes. 

The experiments for the four first cases (A, B, C and D) showed 
that only the stochastic local search is competitive with the T- 
Resilience. We therefore chose to only compare T-Resilience with 
the local search algorithm for the two last failures (E and F). 

Preliminary experiments with each algorithm showed that ini- 
tializing them with the parameters of the reference controller did 
not improve their performance. We interpret these preliminary 
experiments as indicating that the robot needs to use a qualita- 
tively different gait, which requires substantial changes in the pa- 
rameters. This observation is consistent with the gaits we tried to 
design for the damaged robot. As a consequence, we chose to ini- 
tialize each of the compared algorithms with random parameters 
instead of initializing them with the parameters of the reference 
controller. By thus starting with random parameters, we do not 
rely on any a priori about the gaits for the damaged robot: we 
start with the assumption that anything could have happened. 

We replicate each experiment 5 times to obtain statistics. Over- 
all, this comparison requires the evaluation of about 4000 differ- 
ent controllers on the real robot. 

We use 4 Intel(R) Xeon(R) CPU E31230 @ 3.20GHz, each of 
them including 4 cores. Each algorithm is programmed in the 
Sferes^2 framework (M ouret and Doncieux}|2010f and the source- 
code is available as extension 10. The MOEA used in Bongard's 
algorithm and in the T-Resilience algorithm is distributed on 16 
cores using MPI. 

''Depending on the algorithm, it is sometimes impossible to perform exactly 25 
tests (for instance, if two tests are performed for each iteration). 



5. Results 

5.1. Reference controller 

Table |2] reports the performances of the reference controller for 
each tested failure, measured with both the CODA scanner and 
the on-board SLAM algorithm. At best, the damaged robot cov- 
ered 35% of the distance covered by the undamaged robot (0.78 m 
with the undamaged robot, at best 0.26 m after a failure). In cases 
B, C and E, the robot also performs about a quarter turn (Figure |8] 
(a), (b) and (e)); in case D, it falls over; in case F, it alternates for- 
ward locomotion and backward locomotion (figure [8](f)). Videos 
of these behaviors are available in appendix [A| 

This performance loss of the reference controller clearly shows 
that an adaptation algorithm is required to allow the robot to pur- 
sue its mission. Although not perfect, the distances reported by 
the on-board RGB-D SLAM are sufficiently accurate to easily de- 
tect when the adaptation algorithm must be launched. 

5.2. Comparison of performances 

Fig.[9]shows the performance obtained for all test cases and all the 
investigated algorithms. Table |3] reports the improvements be- 
tween median performance values. P-values are computed with 
the Wilcoxon rank-sum tests (appendix [FJ. The horizontal lines 
in Figure [9]show the efficiency of the reference gait in each case. 

The trajectories corresponding to controllers with median per- 
formance values obtained with the T-Resilience are depicted on 
figure [S] Videos of the typical behaviors obtained with the T- 
Resilience on every test case are available in extension (Exten- 
sions 1 to 9). 

Performance with the undamaged robot (case A). When the 
robot is not damaged, the T-Resilience algorithm discovered con- 
trollers with the same level of performance than the reference 
hexapod gait (p-value = 1). The obtained controllers are from 2.5 
to 19 times more efficient than controllers obtained with other al- 
gorithms (Table |3|. 

The poor performance of the other algorithms may appear sur- 
prising at first sight. Local search is mostly impaired by the very 
low number of tests that are allowed on the robot, as suggested 
by the better performance of the ''time'' variant (20 minutes / 
50 tests) versus the "tests" variant (10 minutes / 25 tests). Sur- 
prisingly, we did not observe any significant difference when we 
initialized the control parameters with those of the reference con- 
troller (data not shown). The policy gradient method suffers even 
more than local search from the low number of tests because a lot 
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(f) Middle right leg and front left leg lost (case F). 



Figure 8: Typical trajectories (median performance) observed in every test case. Dashed line: reference gait. Solid line: controller with 
median performance value found by the T-Resilience algorithm. The poor performance of the reference controllers after 
any of the damages shows that adaptation is required in these situations. The trajectories obtained with the T-Resilience 
algorithm are not perfectly straight because our objective function does not explicitly reward straightness (see sections 4.3 
andl53l. 
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Figure 9: Performances obtained in each test cases (distance covered in 3 seconds). On each box, the central mark is the median, 
the edges of the box are the lower hinge (defined as the 25th percentile) and the upper hinge (the 75th percentile). The 
whiskers extend to the most extreme data point which is no more than 1.5 times the length of the box away from the 
box. Each algorithm has been run 5 times and distances are measured using the external motion capture system. Except 
for the T-Resilience, the performance of the controllers found after about 25 transfers (tests) and after about 20 minutes 
(time) are depicted (all T-Resilience experiments last about 20 minutes and use 25 transfers). The horizontal lines denote 
the performances of the reference gait, according to the CODA scanner (dashed line) and according to the SLAM algorithm 
(solid line). 



Koos, Cully and Mouret 



arXiv | 11 





Local search 
tests time 


Policy search 
tests time 


Self-modeling 
time tests 


reference gait 


A 


4.5 


2.5 


3.3 


4.5 


6.3 


19.0 


1.0 


B 


2.3 


2.2 


2.3 


2.3 


+++ 


3.2 


2.3 


C 


2.8 


1.7 


2.8 


2.8 


+++ 


+++ 


1.8 


D 


1.4 


1.1 


2.1 


3.0 


+++ 


+++ 


+++ 


E 


3.4 


2.8 










4.3 


F 


1.7 


1.3 










4.5 


global median 


2.8 


2.0 


2.6 


2.9 


+++ 


+++ 


3.3 



(a) Ratios between median performance values. 





Local search 
tests time 


Policy search 
tests time 


Self-modeling 
time tests 


reference gait 


A 


+59 


+45 


+53 


+59 


+64 


+72 


-2 


B 


+34 


+33 


+34 


+34 


+63 


+42 


+35 


C 


+29 


+18 


+29 


+29 


+63 


+74 


+20 


D 


+ 9 


+ 2 


+16 


+20 


+38 


+30 


+30 


E 


+46 


+42 










+50 


F 


+18 


+10 










+35 


global median 


+32 


+26 


+32 


+32 


+63 


+57 


+33 



(b) Differences between median performance values (cm). 



Table 3: Performance improvements of the T-Resilience compared to other algorithms. For ratios, the symbol +++ indicates that the 
compared algorithm led to a negative or null median value. 



of tests are required to estimate the gradient. As a consequence, 
we were able to perform only 2 to 4 iterations of the algorithm. 
Overall, these results are consistent with those of the literature 
because previous experiments used longer experiments and often 
simpler systems. Similar observations have been reported previ- 
ously by other authors ^Yosinski et al. 2011) . 

Bongard's algorithm mostly fails because of the reality gap be- 
tween the self-model and the real robot. Optimizing the behavior 
only in simulation leads - as expected - to controllers that per- 
form well with the self-model but that do not work on the real 
robot. This performance loss is sometimes high because the con- 
trollers make the robot fall of over or go backward. 

Resilience performance (cases B to F). When the robot is 
damaged, gaits found with the T-Resilience algorithm are always 
faster than the reference gait (p — 0.0625, one-sample Wilcoxon 
signed rank test). 

After the same number of tests (variant tests of each algo- 
rithm), gaits obtained with T-Resilience are at least 1.4 times faster 
than those obtained with the other algorithms (median of 3.0 
times) with median performance values from 30 to 65 cm in 3 
seconds. These improvements are all stastically significant (p 
0.016) except for the local search in the case D (loss of a hind leg; 
p = 0.1508). 

After the same running time (variant time of each algorithm), 
gaits obtained with T-Resilience are also significantly faster (at 
least 1.3 times; median of 2.8 times; p i= 0.016) than those ob- 
tained with the other algorithms in cases B, E and F. In cases C 
(shortened leg) and D (loss of a hind leg), T-Resilience is not sta- 
tistically different from local search (shortened leg: p = 0.1508; 
loss of a hind leg: p = 0.5476). Nevertheless, these high p-values 
may stem from the low number of replications (only 5 replica- 
tions for each algorithm). Moreover, as section 5.3 will show, the 
execution time of the T-Resilience can be compressed because a 
large part of the running time is spent in computer simulations. 
Consequently, depending on the hardware, better performances 
could be achieved in smaller amounts of time. 

For all the tested cases, Bongard's self-modeling algorithm 
doesn't find any working controllers. We observed that it suf- 
fers from two difficulties: the optimized models do not always 
capture the actual morphology of the robot, and reality gaps be- 
tween the self-model and the reality (see the comments about the 



undamaged robot). In the first case, more time and more actions 
could improve the result. In the second time, a better simulation 
model could make things better but it is unlikely to fully remove 
the effect of the reality gap. 

Loss of a leg (case D and E). When the hind leg is lost, the 
T-Resilience leads controllers that perform much better than the 
reference controller. Nevertheless, the performances of the con- 
troller obtained with the T-Resilience are not statistically differ- 
ent from those obtained with the local search. These results stems 
from the fact that many of the transfers made the robot tilt down 
(about half of the 25 transfers), whereas these tests are ignored 
in our current implementation. This make the estimation of the 
transferability function especially difficult. Fast six-legged behav- 
iors optimized on the self-model of the undamaged robot are in- 
deed often unstable without any of the hind legs. The process 
has to use several transfers to find a transferability function that 
allows the avoidance of all the gaits that use the hind leg. More- 
over, in this scenario, performances predicted by the self-model 
and transferability values are clearly antagonistic. Since trans- 
ferred solutions are randomly chosen in the population and the 
multi-objective optimization algorithm maintains a large spec- 
trum of Pareto-optimal trade-offs, transferred controllers have 
a high chance to have a high performance score but a low and 
mostly uninformative transferability score. 

If the robot loses a less critical leg (middle leg in case E), the 
T-Resilience is able to find fast gaits (about 3 times faster than 
with the local search). This failure does not critically impair the 
stability of the robot and the algorithm can conduct informative 
real tests on the robot. 



Straightness of the trajectories. For all the damages, the gaits 
found by T-Resilience do not result in trajectories that are per- 
fectly aligned with the x-axis (Fig.|8j deviations from 10 to 20 cm). 
The deviations mainly stem from the choice of the performance 
function and they should not be overinterpreted as a weakness 
of the T-Resilience algorithm. Indeed, the performance function 
only rewards the covered distance during 3 seconds and nothing 
is explicitly rewarding the straightness of the trajectory. At first, 
it seems intuitive that the fastest trajectories will necessarily be 
straight. However, the fastest gaits achievable with this specific 
robot are unknown and faster gaits may be achievable if the robot 
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Figure 10: Distribution of duration and experimental time for each algorithm (median values on 5 runs of test cases A, B, C, D). All 
the differences between experimental times are statistically significant (p-values < 2.5 x 10~^ with Wilcoxon rank-sum 
tests). 



is not pointing in the arbitrarily-chosen x-direction. For instance, 
the trajectory found with the undamaged robot (Fig. |8ja)) devi- 
ates from the x-axis, but it has the same final performance as the 
reference gait, which is perfectly straight. These two gaits can- 
not be distinguished by the performance function and we have 
no way to know if both faster and straighter trajectories are pos- 
sible in our system. The intuition that the straightest trajectories 
should be the fastest is even more challenged for the damaged 
robots because the robots are not symmetric anymore. In future 
work, we will investigate alternative performance functions that 
encourage straight trajectories. 

Moreover, trajectories are actually mostly straight (Fig. |8] and 
the videos in appendix)) but they do not exactly point to the x- 
direction. This direction seems to be mainly determined by the 
position of the robot at the beginning of the experiment: at t = 0, 
each degree of freedom is positioned according to the value of the 
control function, that is, the robot often starts in the middle of the 
gait pattern; because this position is non-symmetric and some- 
times unstable, we often observed that the first step often makes 
the robot point in a different direction. Once the gait is started, 
deviations along one directions are compensated by symmetrical 
deviations at the next step and the gaits are mostly straight. 

5.3. Comparison of durations and experimental time 

The running time of each algorithm is divided into experimen- 
tal time (actual experiments on the robot), sensor processing time 
(computing the robot's trajectory using RGB-D slam) and opti- 
mization time (generating new potential solutions to test on the 
robot). The median proportion of time allocated to each of this 



IC 



part of the algorithms is pictured for each algorithm on figure 

The durations of the SLAM algorithm and of the optimization 
processes both only depend on the hardware specifications and 
can therefore be substantially reduced by using faster computers 
or by parallelizing computation. Only experimental time can not 
easily be reduced. The median proportion of experimental time is 
29% for the T-Resilience, whereas both the policy search and the 
local search leads to median proportions higher than 40% for a 
similar median duration by run (about 20 minutes). The propor- 
tion of experimental time for the self-modeling process is much 
lower (median value equals to 1%) because it requires much more 
time for each run (about 250 minutes for each run, in our experi- 
ments). 

^Only test cases A, B, C and D are considered to corapute these proportions (5 runs 
for each algorithm) because the policy search and the self-raodeling process are 
not tested in test cases E and F 



The median experimental time of T-Resilience (6.3 minutes) is 
significantly lower than those of local search and of policy search 
(resp. 8.5 and 10.8 minutes, p-values of Wilcoxon rank-sum tests 
< 2.5 X 10~^). With the expected increases of computational 
power, this difference will increase each year. The self-modeling 
process requires significantly lower experimental time (median at 
3.6 minutes, p-values of Wilcoxon rank-sum- tests < 1.5 x 10~^^) 
because it only test actions that involve a single leg, which is 
faster than testing a full gait (3 second). 

6. Conclusion and discussion 

All our experiments show that T-Resilience is a fast and efficient 
learning approach to discover new behaviors after mechanical 
and electrical damages (less than 20 minutes with only 6 minutes 
of irreducible experimental time). Most of the time, T-Resilience 
leads to gaits that are several times better than those obtained 
with direct policy search, local search and Bongard's algorithm; T- 
Resilience never obtained worse results. Overall, T-Resilience ap- 
pears to be a versatile algorithm for damage recovery, as demon- 
strated by the successful experiments with many different types 
of damages. These results validate the combination of the princi- 
ples that underly our algorithm: (1) using a self-model to trans- 
form experimental time with the robot into computational time 
inside a simulation (2) learning a transferability function that pre- 
dicts performance differences between reality and the self-model 
(instead of learning a new self-model) and, (3) optimizing both 
the transferability and performance to learn behaviors in simu- 
lation that will work well on the real robot, even if the robot is 
damaged. These principles can be implemented with alternative 
learning algorithms and alternative regression models. Future 
work will identify whether better performance can be achieved 
by applying the same principles with other machine learning 
techniques. 

During our experiments, we observed that the T-Resilience al- 
gorithm was less sensitive to the quality of the SLAM than the 
other investigated learning algorithms (policy gradient and local 
search). Our preliminary analysis shows that the sensitivity of 
these classic learning algorithms mostly stems from the fact they 
optimize the SLAM measurements and not the real performance. 
For instance, in several of our experiments, the local search algo- 
rithm found gaits that make the SLAM algorithm greatly over- 
estimate the forward displacement of the robot. The T-Resilience 
algorithm relies only on internal sensors as well. However, these 
measures are not used to estimate the performance but to com- 
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pute the transferability values. Gaits that lead to over-estimations 
of the covered distance have low transferability scores because 
the measurement greatly differs from the value predicted by the 
self -model. As a consequence, they are avoided like all the behav- 
iors for which the prediction of the self -model does not match the 
measurement. In other words, the self-model acts as a ''credibility 
check'' of the SLAM measurements, which makes T-Resilience es- 
pecially robust to sensor inaccuracies. If sensors were redundant, 
this credibility check could also be used by the robot to continue 
its mission when a sensor is unavailable. 

The T-Resilience algorithm is designed to perform well with an 
inaccurate self-model, but the better the self-model is, the more 
efficient the algorithm is. For instance, in the case of the lost hind 
leg, the stability properties of the robot were too different from 
those of the self-model and efficient behaviors could hardly be 
found in a limited amount of time. If the situation of the robot 
implies that such a strong failure can not be repaired, then the 
T-Resilience would benefit from an update of the self-model. In 
some extreme cases, the T-Resilience algorithm could also be un- 
able to find any satisfying controller. To identify a new self model, 
the robot could launch Bongard's algorithm and then use the up- 
dated model with the T-Resilience algorithm - to avoid reality 
gaps between the model and the real robot. It seems also pos- 
sible to take advantage of the data recorded by the internal sen- 
sors during the T-Resilience transfers to design a new self-model 
that would minimize the transferability score. The combination 
of T-Resilience and update of the self-model will be addressed in 
future work. 

Overall, learning to predict what behaviors should be avoided 
is a very general concept that can be applied to many situations 
in which a robot has to autonomously adapt its behavior. On a 
higher level, this concept could also share some similarities with 
what human do when they are injured: if a movement is painful, 
humans do not fully understand what cause the pain, but they 
identify the behaviors that cause the pairj^ once they know that 
some move are painful, they learn to instinctively avoid them. 
Humans seem reluctant to permanently change their self-model 
to reflect what behaviors are possible: people with an immobi- 
lized leg still know how to activate their muscles, amputated peo- 
ple frequently report pain in their missing members < |Ramachan- 
|dran and Hirsteuil 1998^ a nd dream about themselves in their 
intact body (Mulder et al. 12008). Humans may therefore learn 
by combining their self-model with a second model that predicts 
which behaviors should be avoided, even if they are possible. 
This model would be similar in essence to a transferability func- 
tion. 
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A. Index to Multimedia Extensions 

See tablelH 



B. NSGA-II 

Figure pl| descri bes the main principle of the NSGA-II algo- 
rithm iDebet al.||2'002) . 



Koos, Cully and Mouret 



arXiv | 16 



Extension Media type Description 



1 


Video 


2 


Video 


3 


Video 


4 


Video 


5 


Video 


6 


Video 


7 


Video 


8 


Video 


9 


Video 


10 


Code 



The median and best gaits obtained with T-Resilience in case A 
The median and best gaits obtained with T-Resilience in case B 
The median and best gaits obtained with T-Resilience in case C 
The median and best gaits obtained with T-Resilience in case D 
The median and best gaits obtained with T-Resilience in case E 
The median and best gaits obtained with T-Resilience in case F 
A full T-Resilience run (25 tests, accelerated 2.5 times) in case B 
A full T-Resilience run (25 tests, accelerated 2.5 times) in case F 
Behavior of the reference controller in each of the six test cases 
Source code (C++) for all the experiments 



(undamaged robot) 
(unpowered leg) 
(shortened leg) 
(Hind leg lost) 
(middle leg lost) 
(middle and hind legs lost) 
(unpowered leg) 
(both middle and hind legs lost) 



Table 4: Multimedia extensions are available online at: jhttp : //chronos . isir . upmc . f r / -mouret /t_resilience| 

Pareto ranking 



copy 




selection + mutation 





Performance 



new generation 



Figure 11: The stochastic multi-objective optimization algorithm NSGA-II <Deb et al. 2002 1. Starting with a population of N ran- 
domly generated individuals, an offspring population of N new candidate solutions is generated using the best candidate 
solutions of the current population. The union of the offspring and the current population is then ranked according to 
Pareto dominance (here represented by having solutions in different ranks connected by lines labeled Li, L2, etc.) and the 
best N candidate solutions form the next generation. 



C. Reference controller 

Table |5] shows the parameters used for the reference controller 
(section [41) . The amplitude orientation parameters {a\) are set 
to 1 to produce a gait as fast as possible, while the amplitude 
elevation parameters (0^2) are set to a small value (0.25) to keep 
the gait stable. The phase elevation parameters (02 ) define two 
tripods: 0.25 for legs 0-2-4; 0.75 for legs 1-3-5. To achieve a cyclic 
motion of the leg, the phase orientation values (01) are chosen by 
subtracting 0.25 to the phase elevation values (02 ), plus a 0.5 shift 
for legs 3-4-5 that are on the left side of the robot. 



Leg number 


1 2 3 4 5 


a] 

Orientation , , 

01 

Elevation 

02 


1.00 1.00 1.00 1.00 1.00 1.00 
0.00 0.50 0.00 0.00 0.50 0.00 
0.25 0.25 0.25 0.25 0.25 0.25 
0.25 0.75 0.25 0.75 0.25 0.75 



Table 5: Parameters of the reference controller. 



method (section D.2 1, a random perturbation c' from a controller 
c is obtained as follows: 

• each parameter cj is obtained by adding to cj a random de- 
viation 6j, uniformely picked up in {—0.25, 0, 0.25}; 

• if Cj is greater (resp. lower) than 1 (resp. 0), it takes the value 
1 (resp. 0). 

The process is iterated during 20 minutes to match the median 
duration of the T-Resilience (Table [S] variant time). For compari- 
son, the best controller found after 25 real tests is also kept (vari- 
ant tests). 

Algorithm 2 Stochastic local search (T real tests) 

c ^ random controller 

f or i = 1 ^ T do 

c ^ random perturbation of c 

UJ='real{c) > Treal{c) then 

end if 
end for 



D. Implementation details 



new controller: c 



D.1. Local search 

Our implementation of the local search (algorithm |2]) starts from 
a randomly generated initial controller. A random perturbation 
c is derived from the current best controller c. The controller c 
is next tested on the robot for 3 seconds and the corresponding 
performance value Treai {c) is estimated with a SLAM algorithm 
using the RGB-D camera. If c performs better than c, c is replaced 
by c' , else c is kept. 

For both the stochastic local search and the policy gradient 



D.2. Policy gradient method 

Our implementation of the policy gradient method is based 
on |Kohl and Stone] < |2004| (algorithm|3j. It starts from a randomly 
generated controller c. At each iteration, 15 random perturba- 
tions c"' from this controller are tested for 3 seconds on the robot 
and their performance values are estimated with the SLAM al- 
gorithm, using the RGB-D camera. The number of random per- 
turbations (15) is the same as in jKohl and Stone) [2004| >, in which 
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only 12 parameters have to be found. For each control parameter 
j, the average performance A-^j (resp. A or A-j) of the con- 
trollers whose parameter value Cj is greater than (resp. equal to 
or less than) the value of Cj is computed. If A is not greater than 
both A-^j and A-j, the control parameter cj is modified as fol- 
lows: 

• Cj is increased by 0.25, if A-^j > A-j and cj < 1; 

• Cj is decreased by 0.25, if A-j > A-^j and Cj > 0. 

Once all the control parameters have been updated, the newly 
generated controller c is used to start a new iteration of the algo- 
rithm. 

The whole process is iterated 4 times (i.e. 60 real tests; variant 
tests) with a median duration of 24 minutes to match the median 
duration of the T-Resilience (Table [SJ. For comparison, the best 
controller found after 2 iterations (i.e. 30 real tests; variant time) 
is also kept. 



Top view Front view 




Figure 12: The six possible actions of a leg that can be tested on 
the robot: (l,a), (2, a), (l,b), (2, b), (l,c), (2, c). 



Algorithm 3 Policy gradient method {T x S real tests) 
c ^ random controller 

f or i = 1 ^ T do 

{c^,c'^, . . . ,c^} ^ S random perturbations of c 
f or j = 1 ^ 5' do 

Aoj — average of Treai {c"') for c"' such as c- — Cj 
A-^j — average oi Treai{c"') for c"' such as c'- > cj 
A- J = average of J^reai{c'^) for c'^ such as c'j < Cj 
if Aqj > max(A+,j, then 

Cj remains unchanged 
else 

ifA^j > A- J then 



else 



min(c^- + 0.25, 1) 
= max(c^- - 0.25, 0) 



end if 
end if 
end for 
end for 

new controller: c 



D.3. Self-modeling process (Bongard's algorithm) 

Our implementation of the self-modeling process is based on 



Bongard et al. 



Bongard et al. 



{ 2006 j (algorithm[4|. Unlike the implementation of 



{ 2006 >, we use internal measurements to assess the 



consequences of actions. This measure is performed with a 3-axis 
accelerometer (ADXL345) placed at the center of the robot, thus 
allowing the robot to measure its orientation. 



Action set. As advised by Bongard < |2007^ (variant II), we use 
a set of actions where each action uses only one leg. The first 
servo has 2 possible positions (1,2): — 7r/6 and tt/G. For each of 
these two positions, we have 3 possible actions (a,b,c) as shown 
on Figure [l2] There are consequently 6 possible actions for each 
leg, that is, 36 actions in total. 



Parameters. A population of 36 models is evolved during 2000 
generations. The initial population is randomly generated for the 
initial learning scenario. For other scenarios, the population is 
initialized with the self-model of the undamaged robot. A new 
action is tested every 80 generations, which leads to a total of 25 
actions tested on the real robot. Applying a new action on the 
robot implies making an additionnal simulation for each model 
at each generation, leading to arithmetic progression of the num- 
ber of simulation needed per generation. Moreover, 36 x 36 ad- 
ditionnal simulations are needed each time a new action has to 
be selected and transferred (the whole action set applyed to the 
whole population). In total, about one million simulations have 
been done per run ((25 X 26/2) X 36x80 + 36 X 36x25 = 968400). 

The self-modeling process is iterated 25 times (i.e. 25 real tests; 
variant tests) before the optimization of controllers occurs, which 
leads to a median duration of 250 minutes on overall. (Table [6} . 
For comparison, the best controller optimized with the self-model 
obtained after 25 minutes of self-modeling is also kept (i.e. after 
11 real tests; variant time). 



Robot's model. The self-model of the robot is a dynamic sim- 
ulation of the hexapod built with the Open Dynamics Engine 
(ODE); it is the same model as the one used for T-Resilience exper- 
iments. However this self -model is parametrized in order to dis- 
cover some damages or morphological modifications. For each 
leg of the robot, the algorithm has to find optimal values for 5 
parameters: 

• length of middle part of the leg (float) 

• length of the terminal part of the leg (float) 

• activation of the first actuator (boolean) 

• activation of the second actuator (boolean) 

• activation of the last actuator (boolean) 

The length parameters have 6 different values: {0, 0.5, 0.75, 
1, 1.25, 1.5}, which represents a scale factor with respect to the 
original size. If the length parameter of one part is zero, the part 
is deleted in the simulation and all other parts only attached to it 
are deleted too. We therefore have a model with 30 parameters. 



E. Median durations and number of tests 



Algorithms 


Median duration 
(min.) 


Median number 
of real tests 


Local search 


20 (10) 


50 (25) 


Policy search 


25 (13) 


60 (30) 


T-Resilience 


19 (19) 


25 (25) 


Self-modeling 


250 (250) 


25 (25) 



Table 6: Median duration and median number of real tests on the 
robot during a full run for each algorithm, for the ''time'' 
variant. Number in parenthesis correspond to the "tests" 
variant. 
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Algorithm 4 Self-modeling approach (T real tests) 



popmodei ^ {m^, m^, . . . , } (randomly generated or not) 

empty training set of actions Q 

f or i = 1 ^ T do 

selection of the action which maximises variance of predictions in popmodei 
execution of the action on the robot 

recording of robot's orientation based on internal measurements (accelerometer) 
addition of the action to the training set Q 
Nmodei iterations of MOEA on popmodei evaluated on Q 
end for 



(1) Self-modeling 



selection of the new self-model 

popctri ^ {c^,c^, . . . , c'^'^*^^ } (randomly generated) 

Nctri iterations of MOEA on popctri in the self-model | (2) Optimization of controllers 
selection of the new controller in the Pareto front 



F. Statistical tests 





Local search 


Policy search 


Self-modeling 


Ref. 




tests 


time 


tests 


time 


time 


tests 


A 


0.008 


0.008 


0.008 


0.008 


0.008 


0.008 


1.000 


B 


0.008 


0.016 


0.016 


0.016 


0.008 


0.008 


0.063 


C 


0.016 


0.151 


0.008 


0.008 


0.008 


0.008 


0.063 


D 


0.151 


0.548 


0.016 


0.087 


0.063 


0.008 


0.063 


E 


0.008 


0.008 










0.063 


F 


0.008 


0.063 










0.063 



Table 7: Statistical significance when comparing performances 
between the T-Resilience and the other algorithms (Ref. 
corresponds to the reference gait). P-values are computed 
with Wilcoxon rank-sum tests. 
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